Context Driven Technique for Document Classification

نویسندگان

  • Upasana Pandey
  • Rahul Jain
چکیده

In this paper we present an innovative hybrid Text Classification (TC) system that bridges the gap between statistical and context based techniques. Our algorithm harnesses contextual information at two stages. First it extracts a cohesive set of keywords for each category by using lexical references, implicit context as derived from LSA and wordvicinity driven semantics. And secondly, each document is represented by a set of context rich features whose values are derived by considering both lexical cohesion as well as the extent of coverage of salient concepts via lexical chaining. After keywords are extracted, a subset of the input documents is apportioned as training set. Its members are assigned categories based on their keyword representation. These labeled documents are used to train binary SVM classifiers, one for each category. The remaining documents are supplied to the trained classifiers in the form of their context-enhanced feature vectors. Each document is finally ascribed its appropriate category by an SVM classifier.

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

The Influence of Data-Driven Exercises Through Using a Computer Program on Vocabulary Improvement in an EFL Context

The present study was conducted to evaluate data driven learning (DDL) combined with Computer Assisted Language Learning (CALL) as an approach to improving vocabulary knowledge of Iranian postgraduates majoring in teaching English, English literature and translation. The purpose was to help language learners get familiar with DDL as a student-centered method taking advantage of a computer progr...

متن کامل

A Document Weighted Approach for Gender and Age Prediction Based on Term Weight Measure

Author profiling is a text classification technique, which is used to predict the profiles of unknown text by analyzing their writing styles. Author profiles are the characteristics of the authors like gender, age, nativity language, country and educational background. The existing approaches for Author Profiling suffered from problems like high dimensionality of features and fail to capture th...

متن کامل

A comparison between knowledge-driven fuzzy and data-driven artificial neural network approaches for prospecting porphyry Cu mineralization; a case study of Shahr-e-Babak area, Kerman Province, SE Iran

The study area, located in the southern section of the Central Iranian volcano–sedimentary complex, contains a large number of mineral deposits and occurrences which is currently facing a shortage of resources. Therefore, the prospecting potential areas in the deeper and peripheral spaces has become a high priority in this region. Different direct and indirect methods try to predict promising a...

متن کامل

Improved Bayesian Training for Context-Dependent Modeling in Continuous Persian Speech Recognition

Context-dependent modeling is a widely used technique for better phone modeling in continuous speech recognition. While different types of context-dependent models have been used, triphones have been known as the most effective ones. In this paper, a Maximum a Posteriori (MAP) estimation approach has been used to estimate the parameters of the untied triphone model set used in data-driven clust...

متن کامل

A New Document Embedding Method for News Classification

Abstract- Text classification is one of the main tasks of natural language processing (NLP). In this task, documents are classified into pre-defined categories. There is lots of news spreading on the web. A text classifier can categorize news automatically and this facilitates and accelerates access to the news. The first step in text classification is to represent documents in a suitable way t...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

عنوان ژورنال:

دوره   شماره 

صفحات  -

تاریخ انتشار 2011